arXiv:1507.05952v3 [cs.DS] 8 Dec 2015 


Optimal Testing for Properties of Distributions 


Jayadev Acharya* 
EECS, MIT 

j ayadevOcsail.mit.edu 


Constantinos Daskalakis^ Gautam KamatP 

EECS, MIT EECS, MIT 

costis@mit.edu g@csail.mit.edu 

December 9, 2015 


Abstract 

Given samples from an unknown distribution p, is it possible to distinguish whether p belongs 
to some class of distributions C versus p being far from every distribution in C? This fundamen¬ 
tal question has received tremendous attention in statistics, focusing primarily on asymptotic 
analysis, and more recently in information theory and theoretical computer science, where the 
emphasis has been on small sample size and computational complexity. Nevertheless, even for 
basic properties of distributions such as monotonicity, log-concavity, unimodality, independence, 
and monotone-hazard rate, the optimal sample complexity is unknown. 

We provide a general approach via which we obtain sample-optimal and computationally 
efficient testers for all these distribution families. At the core of our approach is an algorithm 
which solves the following problem: Given samples from an unknown distribution p, and a 
known distribution g, are p and q close in y^-distance, or far in total variation distance? With 
this tool in place, we develop a general testing framework which leads to the following results: 

• Testing identity to any distribution over [n] requires Q{^/n/£^) samples. This is optimal 
for the uniform distribution. This gives an alternate argument for the minimax sample 
complexity of testing identity (proved in [VV14] 1. 

• For all d > 1 and n sufficiently large, testing whether a discrete distribution over \nY 

is monotone requires an optimal samples. The single-dimensional version 

of our theorem improves a long line of research starting with |BKR04j . where the pre¬ 
vious best tester required Vt{-Jn\og{n)/e'^) samples, while the high-dimensional version 
improves [BFRVll| . which requires fI(n'^“ 2 poly(i)) samples. 

• For all d > 1, testing whether a collection of random variables over [ni] x • • • x [nj] are inde¬ 
pendent requires O (((Hi + Di *^0 samples. A lower bound of ((Hi je^') 

samples is also proved. This result extends the known results for testing independence to 
more than two random variables. For the special case of d = 2, when n\ = ni = n, this 
improves the results of [BFF~*~0l] to the optimal 0(n/e^) sample complexity. 

• Testing whether a discrete distribution over [n] is log-concave requires an optimal ©(y^/e^) 
samples. The same is true for testing whether a distribution has a monotone hazard rate, 
and testing whether it is unimodal. 

The optimality of our testers is established by providing matching lower bounds with respect to 
both n and e. Finally, a necessary building block for our testers and an important byproduct of 
our work are the first known computationally efficient proper learners for discrete log-concave 
and monotone hazard rate distributions. 
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0953960 (CARFFR) and CCF-1101491. 
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1 Introduction 


The quintessential scientific question is whether an unknown object has some property, i.e. whether 
a model from a specific class fits the object’s observed behavior. If the unknown object is a 
probability distribution, p, to which we have sample access, we are typically asked to distinguish 
whether p belongs to some class C or whether it is sufficiently far from it. 

This question has received tremendous attention in the field of statistics (see, e.g., [Fis25tlLRn6j l. 
where test statistics for important properties such as the ones we consider here have been proposed. 
Nevertheless, the emphasis has been on asymptotic analysis, characterizing the rates of convergence 
of test statistics under null hypotheses, as the number of samples tends to infinity. In contrast, we 
wish to study the following problem in the small sample regime: 


n(C,e): Given a family of distributions C, some e > 0, and sample access to an 
unknown distribution p over a discrete support, how many samples are required 
to distinguish between p G C versus dTv(P)C) > e? 


The problem has been studied intensely in the literature on property testing and sublinear 
algorithms [Gol98l IFisOll IRubOGl IRonOSl IGanlSj , where the emphasis has been on characterizing 
the optimal tradeoff between p’s support size and the accuracy e in the number of samples. Several 
results have been obtained, roughly clustering into three groups, where (i) C is the class of monotone 
distributions over \n\, or more generally a poset [BKR.n4( IBFRV iT ; (ii) C is the class of independent. 


or fe-wise independent distributions over a hypergrid [BFF'*~ofl |AAK'*~0^ : and (hi) C contains a 


single-distribution q, and the problem becomes that of testing whether p equals q or is far from 
it |BFF+ni[[Pi5iMlVV14| . 

With respect to (hi), |VV14j exactly characterizes the number of samples required to test 
identity to each distribution q, providing a single tester matching this bound simultaneously for all 
q. Nevertheless, this tester and its precursors are not applicable to the composite identity testing 
problem that we consider. If our class C were finite, we could test against each element in the class, 
albeit this would not necessarily be sample optimal. If our class C were a continuum, we would 
need tolerant identity testers, which tend to be more expensive in terms of sample complexity 
[VVIIj . and result in substantially suboptimal testers for the classes we consider. Or we could use 
approaches related to generalized likelihood ratio test, but their behavior is not well-understood in 
our regime, and optimizing likelihood over our classes becomes computationally intense. 


Our Contributions In this paper, we obtain sample-optimal and computationally efficient testers 
for n(C, e) for the most fundamental shape restrictions to a distribution. Our contributions are the 
following: 

1. For a known distribution q over [re], and given samples from an unknown distribution p, we 
show that distinguishing the cases: (o) whether the y^-distance between p and q is at most 
e^/2, versus (6) the £i distance between p and q is at least e, requires Q{y/n/e^) samples. As 
a corollary, we provide a simpler argument to show that identity testing requires ©(yTi/e^) 
samples (previously shown in |VV14] h 

2. For the class C = of monotone distributions over [re]'’* we require an optimal 0 num¬ 
ber of samples, where prior work requires II / j samples for d = 1 and II (?T'^~ 2 poly (^ m 
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for d> 1 [BKR041IBFRV iT] . Our results improve the exponent of n with respect to d, shave 
all logarithmic factors in n, and improve the exponent of e by at least a factor of 2. 


(a) A useful building block and interesting byproduct of our analysis is extending Birge’s 
oblivious decomposition for single-dimensional monotone distributions [Bir87| to mono¬ 
tone distributions in d > 1, and to the stronger notion of x^-distance. See Section fC.R 

(b) Moreover, we show that O(log'^n) samples suffice to learn a monotone distribution over 
[n]'^ in x^-distance. See Lemma[5]for the precise statement. 


3. For the class C = of product distributions over [ni] x • • • x [n^], our algorithm requires 
O (((n^samples. We note that a product distribution is one where all 
marginals are independent, so this is equivalent to testing if a collection of random variables 
are all independent. In the case where rtf's are large, then the first term dominates, and 
the sample complexity is /e^). In particular, when d is a constant and all rtfS 

are equal to n, we achieve the optimal sample complexity of To the best of our 

knowledge, this is the first result for d > 3, and when d = 2, this improves the previously 
known complexity from O (^polylog(n/e)) |BFF'*~01 ILRR13] . significantly improving the 
dependence on e and shaving all logarithmic factors. 


4. For the classes C = CCVn, C = AiT-LTZn and C = Un of log-concave, monotone-hazard-rate 

and unimodal distributions over [n], we require an optimal 0 number of samples. Our 

testers for CCVn and C = Ail-iTZn are to our knowledge the first for these classes for the low 
sample regime we are studying—see [HVKOSj and its references for statistics literature on the 
asymptotic regime. Our tester for lAn improves the dependence of the sample complexity on 
e by at least a factor of 2 in the exponent, and shaves all logarithmic factors in n, compared 
to testers based on testing monotonicity. 

(a) A useful building block and important byproduct of our analysis are the first computa¬ 
tionally efficient algorithms for properly learning log-concave and monotone-hazard-rate 
distributions, to within e in total variation distance, from poly(l/e) samples, indepen¬ 
dent of the domain size n. See Corollaries [5] and [71 Again, these are the first computation¬ 
ally efficient algorithms to our knowledge in the low sample regime. [ADLSlSl ICDSS14] 
provide algorithms for density estimation, which are non-proper, i.e. will approximate 
an unknown distribution from these classes with a distribution that does not belong to 
these classes. On the other hand, the statistics literature focuses on maximum-likelihood 
estimation in the asymptotic regime—see e.g. [CSlOj and its references. 

5. For all the above classes we obtain matching lower bounds, showing that the sample complex¬ 
ity of our testers is optimal with respect to n, e and when applicable d. See Section [TOl Our 
lower bounds are based on extending Paninski’s lower bound for testing uniformity [PanOSj . 

At the heart of our tester lies a novel use of the x^ statistic. Naturally, the x^ and its related 
(.2 statistic have been used in several of the afore-cited results. We propose a new use of the x^ 
statistic enabling our optimal sample complexity. The essence of our approach is to first draw a 
small number of samples (independent of n for log-concave and monotone-hazard-rate distributions 
and only logarithmic in n for monotone and unimodal distributions) to approximate the unknown 
distribution p in x^ distance. If p G C, our learner is required to output a distribution q that is 
0(e)-close to C in total variation and 0(e^)-close to p in x^ distance. Then some analysis reduces 
our testing problem to distinguishing the following cases: 
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• p and q are 0(e^)-close in distance; this case corresponds to p G C. 

• p and q are 0(e)-far in total variation distance; this case corresponds to dTy{p,C) > e. 

We draw a comparison with robust identity testing, in which one must distinguish whether p 
and q are cie-close or C 2 e-far in total variation distance, for constants C 2 > ci > 0. In |VVllj . 
Valiant and Valiant show that II(n/logn) samples are required for this problem - a nearly-linear 
sample complexity, which may be prohibitively large in many settings. In comparison, the problem 
we study tests for closeness rather than total variation closeness: a relaxation of the previous 
problem. However, our tester demonstrates that this relaxation allows us to achieve a substantially 
sublinear complexity of 0(-y/n/e^). On the other hand, this relaxation is still tight enough to be 
useful, demonstrated by our application in obtaining sample-optimal testers. 

We note that while the statistic for testing hypothesis is prevalent in statistics providing 
optimal error exponents in the large-sample regime, to the best of our knowledge, in the small- 
sample regime, modified-versions of the y^ statistic have only been recently used for closeness¬ 
testing in jADJ~*~l^ ICDVV 1^ and for testing uniformity of monotone distributions in [AJOT13] . 
In particular, [AD.!"*"!^ design an unbiased statistic for estimating the y^ distance between two 
unknown distributions. 

In Section m we show that a version of the y^ statistic, appropriately excluding certain elements 
of the support, is sufficiently well-concentrated to distinguish between the above cases. Moreover, 
the sample complexity of our algorithm is optimal for most classes. Our base tester is combined with 
the afore-mentioned extension of Birge’s decomposition theorem to test monotone distributions in 
Section [5] (see Theorem [3] and Corollary [T]) , and is also used to test independence of distributions 
in Section [3 (see Theorem [6|). 

Naturally, there are several bells and whistles that we need to add to the above skeleton to 
accommodate all classes of distributions that we are considering. For log-concave and monotone- 
hazard distributions, we are unable to obtain a cheap (in terms of samples) learner that y^- 
approximates the unknown distribution p throughout its support. Still, we can identify a subset 
of the support where the y^-approximation is tight and which captures almost all the probabil¬ 
ity mass of p. We extend our tester to accommodate excluding subsets of the support from the 
y^-approximation. See Theorems [3 and [8] in Sections [8] and [H 

For unimodal distributions, we are even unable to identify a large enough subset of the support 
where the y^ approximation is guaranteed to be tight. But we can show that there exists a light 
enough piece of the support (in terms of probability mass under p) that we can exclude to make the 
y^ approximation tight. Given that we only use Chebyshev’s inequality to prove the concentration 
of the test statistic, it would seem that our lack of knowledge of the piece to exclude would involve 
a union bound and a corresponding increase in the required number of samples. We avoid this 
through a careful application of Kolmogorov’s max inequality in our setting. See Theorem [5] of 
Section [6l 

Related Work. For the problems that we study in thie paper, we have provided the related 
works in the previous section along with our contributions. We cannot do justice to the role of 
shape restrictions of probability distributions in probabilistic modeling and testing. It suffices to say 
that the classes of distributions that we study are fundamental, motivating extensive literature on 
their learning and testing [BBBB7^ . In the recent times, there has been work on shape restricted 
statistics, pioneered by Jon Wellner, and others. |JW09i IBWIO] study estimation of monotone and 
k— monotone densities, and [BJRllt ISW14] study estimation of log-concave distributions. 

As we have mentioned, statistics has focused on the asymptotic regime as the number of samples 
tends to infinity. Instead we are considering the low sample regime and are more stringent about the 
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behavior of our testers, requiring 2-sided guarantees. We want to accept if the unknown distribution 
is in our class of interest, and also reject if it is far from the class. For this problem, as discussed 
above, there are few results when C is a whole class of distributions. Closer related to our paper 
is the line of papers [BKR.n4[ lACSlOt IBFRVlf] for monotonicity testing, albeit these papers have 
sub-optimal sample complexity as discussed above. Testing independence of random variables has 
a long history in statisics |R.S811 lAKllj . The theoretical computer science community has also 
considered the problem of testing independence of two random variables [BFF'*~ofl ILRRl^ . While 
our results sharpen the case where the variables are over domains of equal size, they demonstrate 
an interesting asymmetric upper bound when this is not the case. More recently, Acharya and 
Daskalakis provide optimal testers for the family of Poisson Binomial Distributions [AD15| . 

Finally, contemporaneous work of Canonne et al [CDGR,15al ICDGRlhb] provides a generic 
algorithm and lower bounds for the single-dimensional families of distributions considered here. 
We note that their algorithm has a sample complexity which is suboptimal in both n and e, while 
our algorithms are optimal. Their algorithm also extends to mixtures of these classes, though some 
of these extensions are not computationally efficient. They also provide a framework for proving 
lower bounds, giving the optimal bounds for many classes when e is sufficiently large with respect 
to 1/n. In comparison, we provide these lower bounds unconditionally by modifying Paninski’s 
construction |Pan08| to suit the classes we consider. 


2 Preliminaries 


We use the following probability distances in our paper. 

Definition 1. The total variation distance between distributions p and q is defined as 

dTy{p,q) =%up|p(A) - q{A)\ = ^\\p-q\\i. 

A 4 


For a subset of the domain, the total variation distance is defined as half of the £i distance 
restricted to the subset. 

Definition 2. The x^-distance between p and q over [n] is defined by 


x^{p,q) 

iS[n] 


{pi - gif 
qi 


E 



- 1 . 


Definition 3. The Kolmogorov distance between two probability measures p and q over an ordered 
set (e.g., Rj with cumulative density functions (CDF) Fp and Fg is defined as 


dK{p,q) = sup|Fp(x) - Fg(x)|. 


Our paper is primarily concerned with testing against classes of distributions, defined formally 
as follows: 

Definition 4. Given £ G (0,1] and sample access to a distribution p, an algorithm is said to test 
a class C if it has the following guarantees: 

• If p ^C, the algorithm outputs Accept with probability at least 2/3; 

• If d'j'Yip,C) > £, the algorithm outputs Reject with probability at least 2/3. 
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The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives a generic algorithm for learning any 
distribution with respect to the Kolmogorov distance [DKW56] . 

Lemma 1. (See \DKW56f . \Mas90( ) Suppose we have n i.i.d. samples Xi,... from a distribution 
with CDF F. Let Fn{x) ^ be the empirical CDF. Then Pr[dK(-^,-^n) > £] < 

2 g- 2 n£ 2 ^ /n particular, if n = b2((l/e^) • log(l/(5)), then PT[dK{F, Fn) > e] < <5. 

We note the following useful relationships between these distances |GS02| : 

Proposition 1. d^ip^qf < dTy{p,q)^ < lx^{p,q). 

In this paper, we will consider the following classes of distributions: 

• Monotone distributions over [n]'^ (denoted by for which i ^ j implies fi > /B 

• Unimodal distributions over [n] (denoted by Un), for which there exists an i* such that fi is 
non-decreasing for i < i* and non-increasing for i >i*', 

• Log-concave distributions over [n] (denoted by CCDn), the sub-class of unimodal distributions 
for which fi-ifi+i < ff; 

• Monotone hazard rate (MHR) distributions over [n] (denoted by MFLTZn), for which i < j 
implies 

Definition 5. An ? 7 -effective support of a distribution p is any set S such that p{S) > 1 — rj. 

The flattening of a function / over a subset S is the function / such that fi = p(5)/|5'|. 

Definition 6. Letp be a distribution, and support Ii,... is a partition of the domain. The flattening 
of p with respect to Ii,... is the distribution p which is the flattening of p over the intervals Ii,.... 

Poisson Sampling Throughout this paper, we use the standard Poissonization approach. In¬ 
stead of drawing exactly m samples from a distribution p, we first draw m' ~ Poisson(m), and then 
draw m' samples from p. As a result, the number of times different elements in the support of p 
occur in the sample become independent, giving much simpler analyses. In particular, the number 
of times we will observe domain element i will be distributed as Poisson(mpi), independently for 
each i. Since Poisson(m) is tightly concentrated around m, this additional flexibility comes only at 
a sub-constant cost in the sample complexity with an inversely exponential in m, additive increase 
in the error probability. 

3 Overview 

Our algorithm for testing a distribution p can be decomposed into three steps. 

^This definition describes monotone non-increasing distributions. By symmetry, identical results hold for monotone 
non-decreasing distributions. 
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Near-proper learning in x^-distance. Our first step requires a learning algorithm with very 
specific guarantees. In proper learning, we are given sample access to a distribution p G C, where 
C is some class of distributions, and we wish to output q & C such that p and q are close in total 
variation distance. In our setting, given sample access to p G C, we wish to output q such that o is 
close to C in total variation distance, and p and q are close in x^-distance on an effective supporln of 
p. From an information theoretic standpoint, this problem is harder than proper learning, since x^- 
distance is more restrictive than total variation distance. Nonetheless, this problem can be shown 
to have comparable sample complexity to proper learning for the structured classes we consider in 
this paper. 

Computation of distance to class. The next step is to see if the hypothesis q is close to the 
class C or not. Since we have an explicit description of q, this step requires no further samples from 
p, i.e. it is purely computational. If we find that q is far from the class C, then it must be that 
p ^ C, as otherwise the guarantees from the previous step would imply that q is close to C. Thus, 
if it is not, we can terminate the algorithm at this point. 

X^-testing. At this point, the previous two steps guarantee that our distribution q is such that: 

• If p G C, then p and q are close in x^ distance on a (known) effective support of p; 

• If dTy{p,C) > e, then p and q are far in total variation distance. 

We can distinguish between these two cases using 0{y/nje'^) samples with a simple statistical 
X^-test, that we describe in Section 01 

Using the above three-step approach, our tester, as described in the next section, can di¬ 
rectly test monotonicity, log-concavity, and monotone hazard rate. With an extra trick, using 
Kolmogorov’s max inequality, it can also test unimodality. 

4 A Robust Identity Test 

Our main result in the Section is Theorem [2l As an immediate corollary, we obtain the following 
result on testing whether an unknown distribution is close in x^ or far in ii distance to a known 
distribution. In particular, we show the following: 

Theorem 1. For a known distribution q, there exists an algorithm with sample complexity 

0[^fnle^) 


distinguishes between the cases 

• X^iP^Q) < e^/10 versus 

• IIp-^II > 

with probability at least 5/6. 

This theorem follows from our main result of this section, stated next, slightly more generally 
for classes of distributions. 

^We also require the algorithm to output a description of an effective support for which this property holds. This 
requirement can be slightly relaxed, as we show in our results for testing unimodality. 
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Theorem 2. Suppose we are given e E (0,1], a class of probability distributions C, sample access to 
a distribution p over [n], and an explicit description of a distribution q with the following properties: 

Property 1. dTY{Q,C) < §• 

Property 2. If p E C, then x^{P:Q) < 

Then there exists an algorithm with the following guarantees: 


• If p &C, the algorithm outputs Accept with probability at least 2/3; 

• If dTY{p,C) > e, the algorithm outputs Reject with probability at least 2/3. 

The time and sample complexity of this algorithm are O (). 


Remark 1. As stated in Theorem\^ Property 2 requires that q is 0{e‘^)-close in -distance to p 
over its entire domain. For the elass of monotone distributions, we are able to efficiently obtain 
such a q, which immediately implies sample-optimal learning algorithms for this class. However, for 
some classes, we cannot learn a q with such strong guarantees, and we must consider modifieations 
to our base testing algorithm. 

For example, for log-concave and monotone hazard rate distributions, we can obtain a distribu¬ 
tion q and a set S with the following guarantees: 

• Ifp E C, then x^iPS,qs) < 0(£^) and p{S) > 1 - 0(e); 

• If dTY{p,C) > £, then dTY{p,q) > e/2. 


In this scenario, the tester will simply pretend the support ofp and q is S, ignoring any samples and 
support elements in [n] \ S. Analysis of this tester is extremely similar to what we present below. 
In particular, we can still show that the statistic Z will be separated in the two eases. When p G C, 
excluding [nJ/S* will only reduce Z. On the other hand, when dTv(P)C) > e, since p{S) > 1 —0(e), 
p and q must still be far on the remaining support, and we can show that Z is still sufficiently large. 
Therefore, a small modification allows us to handle this ease with the same sample complexity of 
0(V^/e2). 

A further modification can handle even weaker learning guarantees. We could handle the previ¬ 
ous case because the tester “knows what we don’t know” - it can explicitly ignore the support over 
which we do not have a -closeness guarantee. A more difficult case is when there may be a low 
measure interval hidden in our effective support, over which p and q have a large x^-distance. While 
we may have insufficient samples to reliably identify this interval, it may still have a large effect 
on our statistic. A naive solution would be to consider a tester which tries all possible “guesses” 
for this “bad” interval, but a union bound would ineur an extra logarithmic factor in the sample 
complexity. We manage to avoid this cost through a careful analysis involving Kolmogorov’s max 
inequality, maintaining the 0{y/nle^) sample eomplexity even in this more diffieult case. 

Being more precise, we ean handle cases where we ean obtain a distribution q and a set of 
intervals S = {Ii ,..., /;,} with the following guarantees: 

• Ifp E C, then p{S) > 1 —0(e), p{Ij) = Q{p{S)/b) for all j E [b], and there exists a setT C [6] 
such that |r| > 6 — i (for t = 0(1)/ and x^{pR:Qr) < O(e^), where R = Ut/?-; 


If d'Tvip^C) > e, then dTY{p,q) > e/2. 



This allows us to additionally test against the class of unimodal distributions. 

The tester requires that an effeetive support is divided into several intervals of roughly equal 
measure. It computes our statistic over each of these intervals, and we let our statistic Z be the 
sum of all but the largest t of these values. In the case when p ^ C, Z will only become smaller by 
performing this operation. We use Kolmogorov’s maximal inequality to show that Z remains large 
when dTv{p,C) > e. More details on this tester are provided in Section [Pl 


Algorithm 1 Chi-squared testing algorithm 


1: Input: e; an explicit distribution q] (Poisson) m samples from a distribution p, where iVj 
denotes the number of occurrences of the ith domain element. 

2: A {i '. qi> e/SOre} 

Zti&A mqi 

Z < me^/10 then 
return ACCEPT 
else 

return Reject 
end if 


Proof of Theorem 0 Theorem [2] is proven by analyzing Algorithm [TJ As shown in Section Z 
has the following mean and variance: 


E[Z] 



ieA 


(Pi - (lif 

Qi 


m ■ X^iPAPA) 


( 1 ) 


Var[Z] = ^ 
ieA 


lEi. 

9 

L Qi 


+ 4m 


Pi ■ (Pi - qif 


(2) 


where by and q_A we denote respectively the vectors p and q restricted to the coordinates in 
A, and we slightly abuse notation when we write X^{pa-:(1a)-, as these do not then correspond to 
probability distributions. 

Lemma [2] demonstrates the separation in the means of the statistic Z in the two cases of interest, 
i.e., p G C versus d^yipjC) > e, and Lemma [3] shows the separation in the variances in the two 
cases. These two results are proved in Section iBl 


Lemma 2. If p G C, then E [Z] < ^me^. If dT^\{p,C) > e, then E [Z\ > ^me^. 

Lemma 3. If p € C, then Var [Z] < 500000 ^^^^' dTY{p,C) > e, then Var [Z] < 

Assuming Lemmas [2] and [3l Theorem [2] is now a simple application of Chebyshev’s inequality. 
When p G C, we have that 


E [Z] + Vs Var < f — + Vs f —-—^ ^ | 

^ ^ ^ ^ - 1 500 V^oooooy I 


9 X 0 

me < ——me . 


Thus, Chebyshev’s inequality gives 


Pr [Z > me^/lO] < Pr [Z > mV/200] < Pr 


Z 


E[Z] > Vs Var [Z]^/^ 
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The case for d'YY{p,C) > e is similar. Here, 


E [Z] - V3Vai - Vs ^ E[Z] > 3me‘^l20. 


Therefore, 


Pr [Z < mV/lO] < Pr [Z < 3mV/20] < Pr Z — E [Z] < —\/3 Var [Z]^'^^ 


< 

“ 3 


□ 


5 Testing Monotonicity 


As an application of our testing framework, we will demonstrate how to test for monotonicity. Let 
d > 1, and i = (ii,.. .,id),j = {ji,- ■ • ,jd) S [n]‘^. We say i j if ii > ji for / = 1,..., d. 

Definition 7. A distribution p over [nY is monotone (decreasing) if for all p{ < py 

Our main result of this section is as follows: 


Theorem 3. For any d > 1, there exists an algorithm for testing monotonicity over [n]^ with 
sample complexity 


O 


n 


d/2 


+ 


^ d log n ^ 1 


and time complexity O (+ poly(logn, 1/e) 


In particular, this implies the following optimal algorithms for monotonicity testing for all d > 1: 


Corollary 1. Fix any d > 1, and suppose e > ^ . Then there exists an algorithm for testing 

monotonicity over [nY with sample complexity O 

Our analysis starts with a structural lemma about monotone distributions. In |Bir87| . Birge 
showed that any monotone distribution p over [n] can be obliviously decomposed into 0(log(n)/e) 
intervals, such that the flattening p (recall Definition [6]) of p over these intervals is e-close to p in 
total variation distance. |AJOS14] extend this result, giving a bound between the y^-distance of 
p and p. We strengthen these results by extending them to monotone distributions over [nY- In 
particular, we partition the domain [nY of p into 0((dlog(n)/e^)‘^) rectangles, and compare it with 
p, the flattening over these rectangles. 


Lemma 4. Let d > 1 . There is an oblivious decomposition of[nY into 0{{d\og{n)/e^)^) rectangles 
such that for any monotone distribution p over [nY, its flattening p over these rectangles satisfy 

x^{p,P) < V. 


This effectively reduces the support size to logarithmic in n. At this point, we can apply the 
Laplace estimator (along the lines of |KOPS15] ) and learn a q such that if p was monotone, then q 
will be 0(e^)-close in y^-distance. 

Lemma 5. Let d > 1, and p be a monotone distribution over [nY ■ There is an algorithm which 
outputs a distribution q such that E [x^{p,q)] < The time and sample complexity are both 

0((dlog(n)/e^)'^/e^). 
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The final step before we apply our y^-tester is to compute the distance between q and 
This subroutine is similar to the one introduced by [BKR,04j . The key idea is to write a linear 
program, which searches for any distribution / which is close to q in total variation distance. We 
note that the desired properties of / (i.e., monotonicity, normalization, and e-closeness to q) are 
easy to enforce as linear constraints. If we find that such an / exists, we will apply our y^-test to 
q. If not, we output Reject, as this is sufficient evidence to conclude that p 0 Note that 

the linear program operates over the oblivious decomposition used in our structural result, so the 
complexity is polynomial in {dlog{n)/, rather than the naive n'^. 

At this point, we have precisely the guarantees needed to apply Theorem [2l directly implying 
Theorem [3l Proof of the lemmas in this section are provided in Section 0 We note that the 
class of monotone distributions is the simplest of the classes we consider. We now consider testing 
for log-concavity, monotone hazard rate, and unimodality, all of which are much more challenging 
to test. In particular, these classes require a more sophisticated structural understanding, more 
complex proper y^-learning algorithms, and non-trivial modifications to our y^-tester. We have 
already given some details on the required adaptations to the tester in Remark [H 

Our algorithms for learning these classes use convex programming. One of the main challenges is 
to enforce log-concavity of the PDF when learning CCDn (respectively, of the CDF when learning 
AiT-LTZn)) while simultaneously enforcing closeness in total variation distance. This involves a 
careful choice of our variables, and we exploit structural properties of the classes to ensure the 
soundness of particular Taylor approximations. We encourage the reader to refer to the proofs of 
Theorems m [Hand El for more details. 

6 Testing Unimodality 

One striking feature of Dirge’s result is that the decomposition of the domain is oblivious to the 
samples, and therefore to the unknown distribution. However, such an oblivious decomposition 
will not work for the unimodal distribution, since the mode is unknown. Suppose we know where 
the mode of the unknown distribution might be, then the problem can be decomposed into mono¬ 
tone functions over two intervals. Therefore, in theory, one can modify the monotonicity testing 
algorithm by iterating over all the possible n modes. Indeed, by applying a union bound, it then 
follows that 

Theorem 4. (Follows from Monotone) For e > there exists an algorithm for testing uni¬ 

modality over [n] with sample complexity O ^-^logn^. 

However, this is unsatisfactory, since our lower bound (and as we will demonstrate, the true 
complexity of this problem) is -y/n/e^. We overcome the logarithmic barrier introduced by the 
union bound, by employing a non-oblivious decomposition of the domain, and using Kolmogorov’s 
max-inequality. 

Our main result for testing unimodality is the following theorem, which is proved in Section [Pl 

Theorem 5. Suppose e > . Then there exists an algorithm for testing unimodality over [n] 

with sample complexity 0{y/nle^). 

7 Testing Independence of Random Variables 

def 

Let X = [ni] X ... X [n^], and let H^ be the class of all product distributions over X. We first 
bound the y^-distance between product distributions in terms of the individual coordinates. 
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Lemma 6. Let p = x ... x p'^, and q = x ... x be two distributions in Then 


d 

x^{p,q) = + - 1- 


£=1 


Proof. By the definition of x^-distance 

x^ip,q) = — 

d 

=n 

i=i 

= n (i+x^(/>/)) -1- 

1=1 


spMl 


(3) 

(4) 

(5) 


□ 

Along the lines of learning monotone distributions in distance we obtain the following result, 
proved in Section lEl 

Lemma 7. There is an algorithm that takes 



samples from a distribution p in 11^ and outputs a distribution q ^Yid such that with probability at 
least 5/6, 

x^{p,q) < 0(e^)- 

This fits precisely in our framework of robust testing. In particular, applying Theorem [21 
we obtain the following result. 

Theorem 6. For any d>l, there exists an algorithm for testing independence of random variables 
over [ni] x ... [nu] with sample and time complexity 


The following corollaries are immediate. 

Corollary 2. Suppose > Yld=i'^£- Then there exists an algorithm for testing indepen¬ 
dence over [m] X • • • X [ud] with sample complexity /^‘^)■ 

In particular, 

Corollary 3. There exists an algorithm for testing if two distributions over [n] are independent 
with sample complexity 0(n/e^). 
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8 Testing Log-Concavity 


In this section we describe our results for testing log-concavity of distributions. Our main result is 
as follows: 


Theorem 7. 


There exists an algorithm for testing log-coneavity over [n] with sample complexity 


O 



and time complexity poly(n, 1/e). 

In particular, this implies the following optimal tester for this class: 

Corollary 4. Suppose £ > . Then there exists an algorithm for testing log-concavity over 

[n] with sample complexity O 

Our algorithm will fit into the structure of our general framework. We first perform a very 
particular type of learning algorithm, whose guarantees are summarized in the following lemma: 

Lemma 8. Given e > 0 and sample access to a distribution p, there exists an algorithm with the 
following guarantees: 

• If p ^ CCVn, the algorithm outputs a distribution q E CCVn and an 0(e)-effective support S 
of p such that x^iPs,Qs) < ^ with probability at least 5/6; 

• If dTwip, CCVn) ^ algorithm either outputs a distribution q E CCVn or Reject. 

The sample complexity is 0(l/e®) and the time complexity is poly(n, 1/e). 

We note that as a corollary, one immediately obtains a 0(l/e®) proper learning algorithm for log- 
concave distributions. The result is immediate from the first item of Lemma [8] and Proposition [1] 
We can actually do a bit better - in the proof of Lemma El we partition [n] into intervals of 
probability mass 0(e^/^). If one instead partitions into intervals of probability mass 0(e/log(l/e)) 
and works directly with total variation distance instead of distance, one can show that 0(1/6“^) 
samples suffice. 

Corollary 5. Given e > 0 and sample access to a distribution p E CCVn, there exists an algorithm 
which outputs a distribution q E CCVn such that dT'y(p,q) < e. The sample complexity is 0(1/e^) 
and the time complexity is poly(n, 1/e). 

Then, given the guarantees of Lemma El Theorem [7] follows from Theorem [^. The details of 
these results are presented in Section [Fl 

9 Testing for Monotone Hazard Rate 

In this section, we obtain our main result for testing for monotone hazard rate: 

®To be more precise, we require the modification of Theorem [7] which is described in Section [T] in order to handle 
the case where the y^-distance guarantees only hold for a known effective support. 
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Theorem 8. 

complexity 


There exists an algorithm for testing monotone hazard rate over [n] with sample 

^(Vn log(n/e)\ 

Ve" ) 


and time complexity poly(n, 1/e). 


This implies the following optimal tester for the class: 

Corollary 6. Suppose e > y/\og{n/e). Then there exists an algorithm for testing monotone 
hazard rate over [n] with sample complexity O (-v/n/e^). 

We obey the same framework as before, first applying a y^-learner with the following guarantees: 


Lemma 9. Given e > 0 and sample access to a distribution p, there exists an algorithm with the 
following guarantees: 

• Ifp G M.TiTZn, the algorithm outputs a distribution q G MiHlZn and an 0{s)-effective support 
S of p such that x^ips,Qs) < ^ with probability at least 5/6; 

• If d'i\{p,M.TiR.n) > £, the algorithm either outputs a distribution q G AiTlIZn and a set 
S C [n] or Reject. 

The sample complexity is 0(log(n/e)/e‘^) and the time complexity is poly(n, 1/e). 

As with log-concave distributions, this implies the following proper learning result: 


Corollary 7. Given e > 0 and sample access to a distribution p G AiTLTZn, there exists an 
algorithm which outputs a distribution q G MTCR-n such that dTv{p,q) < e. The sample complexity 
is 0(log(n/e)/e^) and the time complexity is poly(n, 1/e). 

Again, combining the learning guarantees of Lemma [9] with the appropriate variant of Theo¬ 
rem [2l we obtain Theorem [8l The details of the argument and proofs are presented in Section El 


10 Lower Bounds 


We now prove sharp lower bounds for the classes of distributions we consider. We show that the 
example studied by Paninski |Pan08j to prove lower bounds on testing uniformity can be used 
to prove lower bounds for the classes we consider. They consider a class Q consisting of 2"'/^ 
distributions defined as follows. Without loss of generality assume that n is even. For each of the 
2"/2 vectors zqZi ... Zni 2 -i £ {“!) define a distribution q & Q over [n] as follows. 


Qi = 


ioT i = 2i + l 
for i = 2i. 

n 


( 6 ) 


Each distribution in Q has a total variation distance ce/2 from f7„, the uniform distribution 
over [n]. By choosing c to be an appropriate constant, Paninski [PanOSj showed that a distribution 
picked uniformly at random from Q cannot be distinguished from Un with fewer than samples 

with probability at least 2/3. 

Suppose C is a class of distributions such that 


• The uniform distribution Un is in C, 
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• For appropriately chosen c, (iTv(C, Q) > e, 

then testing C is not easier than distinguishing Un from Q. Invoking [Pan08| immediately implies 
that testing the class C requires Q.{^/n/£^) samples. 

The lower bounds for all the one dimensional distributions will follow directly from this con¬ 
struction, and for testing monotonicity in higher dimensions, we extend this construction to d > 1, 
appropriately. These arguments are proved in Section [Hj leading to the following lower bounds for 
testing these classes: 

Theorem 9. 

• For any d > 1, any algorithm for testing monotonicity over [n]'^ requires samples. 

• For d> 1, any algorithm for testing independence over [m] x • • • x [re^] requires ^ 

samples. 

• Any algorithm for testing unimodality, log-concavity, or monotone hazard rate over [n] requires 
Q{y/n/e‘^) samples. 
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A Moments of the Chi-Squared Statistic 

We analyze the mean and variance of the statistic 

{Xi - mqi)"^ - Xi 


2 = E 

i&A 


mqi 


where each Xi is independently distributed according to Poisson(mpi). 
We start with the mean; 


E[Z] = '^E 


ieA 

E 

ieA 

E 

ieA 


{Xj - mqif - Xj 
mqi 


IE [Xf] - 2mq^E [W] + m^qf - E [W] 

mqi 

0 0 o o o 

m + mpi — 2m qipi + m q~ — mpi 


mqi 


= m 


E 

ieA 
, 2 / 


(pi - qi? 

Qi 


= m ■ X {pA, Qa) 

Next, we analyze the variance. Let A* = IE [Xi] = mpi and A' = mqi. 

Var [Z] = ^ ^ Var [(W - A,)^ + 2(W - Ai)(Ai - A') - (W - A^)] 
ieA 

= i Var [{Xi - Xif + (Xi - A,)(2A, - 2A' - 1)] 
ieA 

= [{Xi - A,)" + 2{Xi - Xif{2Xi - 2X'i - 1) + (W - \if{2X^ - 2A' - 1)^ - Xj\ 

ieA 

= + 2Aj(2Aj — 2X{ — 1) + Aj(2Aj — 2X{ — 1)^ — A^] 

ieA 

= + Aj + 4Ai(Aj — A^) — 2Ai + Ai(4(Aj — A^)^ — 4(Ai — A^) + 1)] 

ieA 


ieA 

E 

ieA 


2^ + 4m 
L qf 


Pi ■ (Pi - Qi) 

qf 


21 


(7) 


The third equality is by noting the random variable has expectation A* and the fourth equality 
substitutes the values of centralized moments of the Poisson distribution. 
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B Analysis of our y^-Test Statistic 

We first prove the key lemmas in the analysis of our y^-test. 

Proof of LemmalM' The former case is straightforward from ([1]) and Property 2| of q. 

We turn to the latter case. Recall that A = {i : Qi > e/50n}, and thus q{A) < e/50. We first 
show that d'Ty{p^, qX) > f|, where p^, qj\^ are defined as above and in our slight abuse of notation 
we use d'T'viPA^Q.A) non-probability vectors to denote \\\pA ~ Q'.aIIi- 
Partitioning the support into A and A, we have 


C^Tv(p, q) = dTY{pA, Qa) + dTYiPA, Qa)- 

We consider the following cases separately: 

• piA) < el2: In this case, 

dTYiPA^QA) = l'^\Pi-qi\ < lipi.-^) + q{-^)) + = ^ 


( 8 ) 


ieA 


2 V2 50 


50 


25 - 


Plugging this in ([8]), and using the fact that d'Ty{p,q) > e shows that dTYiPAiQA) ^ ^ 
p{A) > e/2: In this case, by the reverse triangle inequality, 

dTviPA,qA) > ^iq{A) -p{A)) > ^((1 - e/50) - (I - e/2)) = 


By the Cauchy-Schwarz inequality, 


X^{PA,qA) > 4 


dT^iPA^QA? 


q{A) 


e2 

> —. 
“ 5 


We conclude by recalling ([T]). 

Proof of Lemma 0 We bound the terms of ([2]) separately, starting with the first. 


□ 


2EI=2E 


ieA ieA ^ 


{pi - qif , "^pm - qI 


+ 




i&A 


QI 


< 2 „ + 2 y + 


iGA 


< 4n -|- 4 


QI 


{Pi - Qif 

q1 


Qi 


< 


= 4n -|- 


i&A 

200re {pi - qif 

e ; Qi 
i&A ^ 

200n E[Z] 


, 200re 

4n 4-^ 


e m 

I 


< 4n H- y/nE\Z] 

100 ^ ^ 


(9) 
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The second inequality is the AM-GM inequality, the third inequality uses that qi > ^ for all 

i G A, the last equality uses and the final inequality substitutes a value m > 20000-^. 

The second term can be similarly bounded: 




i&A 



{pi - (nY 


1/2 


< 4m 


< 4m 


4n + —VnE[Z] 


1/2 


E 


{Pi - (nY 


1/2 


(2VS+4nV->E|Z]V2) 


\ieA 


{Pi - gif 
Qi 


= E[Z] 


The first inequality is Cauchy-Schwarz, the second inequality uses Q, the third inequality uses the 
monotonicity of the ip norms, and the equality uses (fTl) . 

Combining the two terms, we get 

O 

Var [Z] < 4n + 9\/nE \Z] H—. 

We now consider the two cases in the statement of our lemma. 


When p G C, we know from Lemma [2] that K[Z] < ^me^. Combined with a choice of 
m > 20000^ and the above expression for the variance, this gives: 


Var [Z] < 


200002 


m^e'^ + 


9 


20000 • 500 


m^e^ + 


Vw 


12500000 


-m^e^ < 


1 


500000 




When dTv{p,C) > e, Lemma [2] and m > 20000-^ give: 

E [Zl > -me^ > 4000\Ai- 
5 


Combining this with our expression for variance we get: 


Var [Z1 < ^^E [Zl^ + -^E [Zl^ + ,_ 

^ ^ - 40002 ^ ^ 4000 ^ ^ 5^4000 


2 9 1 9 

E [Zl^ < -E [Zl^ . 

l J - 100 I J 


□ 


C Details on Testing Monotonicity 

In this section, we prove the lemmas necessary for our monotonicity testing result. 
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C.l A Structural Result for Monotone Distributions on the Hypergrid 


Birge |Bir87] showed that any monotone distribution is estimated to a total variation e with a 
0(log(n)/e)-piecewise constant distribution. Moreover, the intervals over which the output is con¬ 
stant is independent of the distribution p. This result, was strengthened to the Kullback-Leibler 
divergence by |AJUS14] to study the compression of monotone distributions. They upper bound 
the KL divergence by distance and then bound the distance. We extend this result to [n]'^. 
We divide [n]'^ into rectangles as follows. Let {/i,...be a partition of [n] into consecutive 
intervals defined as: 

Ij.i _ for 1 < i < |, 

^ y [2(1 -I- for ^ < j <b. 

For j = (ji,... Ja) G [bf, let Ij x Ix ... x Ij^. 

The distance between p and p can be bounded as 


x^ip,p) = 


EE 

je[fe]'^ ie/j 


Pi 


< 


E 


- 1 


- 1 


For j = (ji,... Jj) E 5iarge, let j* = (Ji,... ,ib) be 

ji if ji <6/2 + 1 
ji — 1 otherwise. 

We bound the expression above as follows. 

Let T C [d] be any subset of d. Suppose the size of T is i. Let T be the set of all j that satisfy 
ji = bj2 + 1 for i ^T. In other words, over the dimensions determined by T, the value of the index 
is equal to (i/2 + 1. The map j ^ j* restricted to T is one-to-one, and since at most d — i ol the 
coordinates drop, 

Since there are i coordinates that do not change, and each of them have 2(1 + 7 ) coordinates, we 
obtain 

J2Pi < Y.P? 

jer jeT 

= E^T 

j6T 

Since the mapping is one-to-one, the probability of observing as element in T is the probability 
of observing 5/2 + 1 in I coordinates, which is at most (2/(5+2))^ under any monotone distribution. 
Therefore, 

jer ^ ^ 


/j|-(2(1+7))''(1++-' 

/j.|-2'(l+7y 
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For any I there are choices for T. Therefore, 


X' 


'ip,p) < 


e=o 


ej \b + 2 


= (1 + 7)" 1 + 


6 + 2 


(l + 7)"-l 


- 1 


I 4 47 

= 7 + 7 ++ 


- 1 


Recall that 7 = 21og(n)/6 > 1/6, implies that the expression above is at most (1 + 27 )'^ — 1. 
This implies Lemma [H 

C.2 Monotone Learning 

Our algorithm requires a distribution q satisfying the properties discussed earlier. We learn a 
monotone distribution from samples as follows. 

Before proving this result, we prove a general result for learning of arbitrary discrete distri¬ 
butions, adapting the result from [KOPSl.^ . For a distribution p, and a partition of the domain 
into 6 intervals R,...let pi = p{Ii)/\Ii\ be the flattening of p over these intervals. We saw 
that for monotone distributions there exists a partition of the domain such that p is close to the 
underlying distribution in distance. 

Suppose we are given m samples from a distribution p and a partition R,..., I^. Let rrij be the 
number of samples that fall in Ij. For i ^ Ij, let 

def 1 ruj + 1 

= ITT I A ■ 

\Ij\ m + b 

Let Sj = YliGi Pi - expected x^ distance between p and q can be bounded as follows. 


E [x^{p,q)] = 


EEE 

j=l i^Ij £=0 


m 


(p(/,))'(i-p(r)) 


m—i 


Pi 


m + b 


E 






(£ + l)/(|/,|(m + 6)) 

E(7/))(pUi)+'(i-pUi)) 


- 1 


m +1—£+1 


- 1 


m + 6 


E 


5 , 




(i-(i-p(/x+') 


- 1 


< 


m + 6 


E 


5 , 




m + b 


m + 
m + b 


Y {x^{p,p) + 1 ) 

h 

■X^{p,p) + 


- 1 


- 1 


m + 1 m + 1 

Suppose 7 = 0(log(n)/6), and 6 = 0{d ■ \og{n)/£^). Then, by LemmalU 

x^{p,p) < 

Combining this with (jlOh gives Lemma [5l 


( 10 ) 

( 11 ) 
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D Details on testing Unimodality 


Recall that to circumvent Dirge’s decomposition, we want to decompose the interval into disjoint 
intervals such that the probability of each interval is about 0(1/6), where 6 is a parameter, specified 
later. In particular we consider a decomposition of [n] with the following properties: 

1. For each element i with probability at least 1/6, there is an = {i}- 

2. There are at most two intervals with p{I) < 1/26. 

3. Every other interval I satisfies p{I) G [^) |] • 

Let Ii,..., II denote the partition of [n] corresponding to these intervals. Note that L = 0(6). 

Claim 1. There is an algorithm that takes 0(61og6) samples and outputs Ii,... ,Il satisfying the 
properties above. 


The first step in our algorithm is to estimate the total probability within each of these intervals. 
In particular, 


Lemma 10. There is an algorithm that takes m' = 0(61og6/e^) samples from a distribution p, 
and with probability at least 9/10 outputs a distribution q that is constant on each II- Moreover, 
for any j such that p{Ij) > 1/26, q{Ij) G (1 ± e)p{Ij). 


Proof. Consider any interval Ij with p{Ij) > 1/26. The number of samples Nj. that fall in that 
interval is distributed Binomial{m' ,p{Ij). Then by Chernoff bounds for m! > 1261og6/e^, 


Pr 


m'p{Ij)\ > em'p{Ij)) <2exp (e^ m'p{Ij) //) 



( 12 ) 

(13) 


where the last inequality uses the fact that p{Ij) > 1/26. 


□ 


The next step is estimate the distance of q from This is possible by a simple dynamic 
program, similar to the one used for monotonicity. If the estimated distance is more than e/2, we 
output Reject. 

Our next step is to remove certain intervals. This will be to ensure that when the underlying 
distribution is unimodal, we are able to estimate the distribution multiplicatively over the remaining 
intervals. In particular, we do the following preprocessing step; 


• A = $. 


• For interval Ij, 

- If 

(?(/,) ^((l-£)-g(/,+i),(l + e)-g(/,+i)) OR (14) 

Qilj) i ((1 - (1 + e) • g(^j-i)), (15) 

add Ij to A. 

• Add the (at most 2) intervals with mass at most 1/26 to A. 

• Add all intervals j with q{Ij)/\Ij\ < e/50n to A 
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If the distribution is unimodal, we can prove the following about the set of intervals A'^. 
Lemma 11. If p is unimodal then, 

• > 1 — e/25 — 1/6 — O (logn/(e6)). 

• Except at most one interval in every other interval Ij satisfies, 

% < (1 + e). 

Pj 

If this holds, then the distance between p and q constrained to A'^, is at most This lemma 
follows from the following result. 

Lemma 12. Let C > 2. For a unimodal distribution over [n], there are at most intervals 

Ij that satisfy -Ar < {1 + e/C). 

Proof. To the contrary, if there are more than intervals, then at least half of them are 

on one side of the mode, however this implies that the ratio of the largest probability and smallest 
probability is at least (1 + e/Cfi, and if j > is at least 50n/e, contradicting that we 

have removed all such elements. □ 

We have one additional pre-processing step here. We compute q{A^) and if it is smaller than 
1 — e/25, we output Reject. 

Suppose there are L' intervals in A^. Then, except at most one interval in L' we know that the 
distance between p and q is at most when p is unimodal, and the TV distance between p 
and q is at least e/2 over A"^. We propose the following simple modification to take into account, 
the one interval that might introduce a high x^ distance in spite of having a small total variation. 
If we knew the interval, we can simply remove it and proceed. Since we do not know where the 
interval lies, we do the following. 

1. Let Zj be the jfi' statistic over the ith interval in A^, computed with 0(\/n/e^) samples. 

2. Let Zi be the largest among all Zj'’s. 

3- If Hu*, Zi > me^/10, output Reject. 

4. Output Accept. 

The objective of removing the largest yfi statistic is our substitute for not knowing the largest 
interval. We now prove the correctness of this algorithm. 

Case 1 p G UMn- We only concentrate on the final step. The x^ statistic over all but one interval 
are at most c • me^, and the variance is bounded as before. Since we remove the largest statistic, 
the expected value of the new statistic is strietly dominated by that of these intervals. Therefore, 
the algorithm outputs Accept with at least the same probability as if we removed the spurious 
interval. 
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Case 2 p ^ UMn'. This is the hard case to prove for unimodal distributions. We know that the 
statistic is large in this case, and we therefore have to prove that it remains large even after 
removing the largest test statistic Zi. 

We invoke Kolmogorov’s Maximal Inequality to this end. 

Lemma 13 (Kolmogorov’s Maximal Inequality). For independent zero mean random variables 
Xi, ..., Xl with finite variance, let Si = Xi + ... X£. Then for any A > 0, 

Pr ( |5£| > <^-Var (Sl) ■ (16) 

As a corollary, it follows that Pr (max£ \Xi\ > 2A) < ^ • Var {Sl)- 

In the case we are interested in, we let Xi = — E \Z(\. Then, similar to the computations 

before, and the fact that each interval has a small mass, it follows that that the variance of the 
summation is at most E [Zfi^ /lOO. Taking A = E [S'/, — me^/3] /lOO, it follows that the statistic 
does not fall below to ^/n. This completes the proof of Theorem [5l 


E Learning product distributions in distance 


In this section we prove Lemma [71 The proof is analogous to the proof for learning monotone 
distributions, and hinges on the following result of [KOPS15] . Given m samples from a distribution 
q over n elements, the add-1 estimator (Laplace estimator) q satisfies: 

E [x^{p,q)] < 

Now, suppose p is a product distribution over X = [ni] x • • • x [nj. We simply perform the 
add-1 estimation over each coordinate independently, giving a distribution q^ x • • • x q'^. Since 
p is a product distribution the estimates in each coordinate is independent. Therefore, a simple 
application of the previous result and independence of the coordinates implies 


E [x^(p,g)] = (l+ E x^(p',g') 


-1 


1=1 

d 


1=1 ^ 

V m -|- 1 


(17) 


where (I17p follows from e^ > 1 -|- x. Using < 1 -|- 2x for 0 < x < 1, we have 

]E[x'(p,g)] <2^^, (18) 

when m > n/. Therefore, following an application of Markov’s inequality, when m = II((X[; ni)/e^), 

Lemma [7] is proved. 
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F Details on Testing Log-Concavity 


It will suffice to prove Lemma El 

Proof of LemmaO' We first draw samples from p and obtain a 0(l/e^/^)-piecewise constant distri¬ 
bution / by appropriately flattening the empirical distribution. The proof is now in two parts. In 
the first part, we show that if p G CCVn then / will be close to p in distance over its effective 
support. The second part involves proper learning of p. We will use a linear program on / to find 
a distribution q G CCVn- This distribution is such that if p G CCVn, then ^{p,q) is small, and 
otherwise the algorithm will either output some q G CCT>n (with no other relevant guarantees) or 
Reject. 

We first construct /. Let p be the empirical distribution obtained by sampling 0(l/e^) samples 
from p. By Lemma [H with probability at least 5/6, dK{p,p) < e^^^/lO. In particular, note that 
\Pi — Pi\ < Condition on this event in the remainder of the proof. 

Let a be the minimum i such that p, > e^/^/5, and let b be the maximum i satisfying the same 
condition. Let M = {a, ■ ■ ■ ,b} or 0 if a and b are undefined. By the guarantee provided by the 
DKW inequality, pi > e^/^/10 for all i G M. Furthermore, pi £ piP: e^/^/10 G (1 ± e) • pi. For each 
i G M, let fi = Pi- We note that \M\ = 0(1/s), so this contributes 0{l/s) constant pieces to /. 

We now divide the rest of the domain into t intervals, all but constantly many of measure 
©(e^/^) (under p). This is done via the following iterative procedure. As a base case, set tq = 0. 
Dehne Ij as [lj,rj], where Ij = rj_i -|-1 and rj is the largest j G [n] such that p{Ij) < 9e^/^/10. The 
exception is if Ij would intersect M - in this case, we “skip” M: set rj = a—l and Ijj^i = b + 1. 
If such a j exists, denote it by j*. We note that p{Ij) < p{Ij) + jit) < Furthermore, for 
all j except j* and t, Vj + l^ M, so p{Ij) > 9e^/^/10 — e^/^/5 — > 3e^/^/5. Observe that 

this lower bound implies that t < for e sufficiently small. 

Part 1. For this part of the algorithm, we only care about the guarantees when p G CCPn, so we 
assume this is the case. 

For the domain [n] \ M, we let / be the flattening of p over the intervals Ji,... R. To analyze 
/, we need a structural property of log-concave distributions due to Chan, Diakonikolas, Servedio, 
and Sun [CDSS13] . This essentially states that a log-concave distribution cannot have a sudden 
increase in probability. 

Lemma 14 (Lemma 4.1 in |CDSS13] i. Let p be a distribution over [n] that is non-decreasing and 
log-concave on [l,x] C [n]. Let I = [x,y\ he an interval of mass P{I) = r, and suppose that the 
interval J = [1, x — 1] has mass p{J) = a > 0. Then 

p{y)/p{x) <1-1- r/a. 

Recall that any log-concave distribution is unimodal, and suppose the mode of p is at i^. We 
will first focus on the intervals R,... ,1^^ which lie entirely to the left of R and M. We will refer 
to Ij as Lj for all j <tL- Note that p is non-decreasing over these intervals. 

The next steps to the analysis are as follows. First we show that the flattening of p over Lj is a 
multiplicative (1 -|- 0(l/j)) estimate for each pi G Lj. Then, we show that flattening the empirical 
distribution p over Lj is a multiplicative (1 -|- 0{1/j)) estimate of p(z) for each i G Lj. Finally, we 
exclude a small number of intervals (those corresponding to 0(e) mass at the left and right side of 
the domain, as well as j*) in order to get the approximation we desire on an effective support. 

• First, recall that p{Lj) < for all j. Also, letting Jj = [l,rj_i], we have that p{Jj) > 
{j — 1) • 3e^/^/5. Thus by Lemma EH p{rj) < p(/j)(l -|- 2/{j — 1)). Since the distribution is 
non-decreasing in Lj, the flattening p of p is such that p(i) G p(f)(l ± J^) i £ Lj. 
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• We have that p{Lj) > 3e^/^/5, and p{Lj) G p{Lj) zb e^/^/10, so p{Lj) G p{Lj) ■ (1 zb |), and 
hence p{i) G p{i) • (1 zb |) for all i G Lj. Combining with the previous point, we have that 


p{i) G p{i) • ( 1 ± 


2e ^ e , 2 

3(j - 1) + 6 ^ 


Gp(i) 



- — - 1 • 


A symmetric statement holds for the intervals that lie entirely to the right of io and M. We 
will refer to Ij as Rt-j for all i > tL- 

To summarize, we have the following guarantees for the distribution /: 

• For all i G M, f{i) G p{i) ■ (1 zb e); 

• For all i G Lj (except Li and Lj*), f{i) G p{i) ■ fl zb lyV 


• For all i G Rj (except Ri), f{i) G p{i) ■ zb ; 


Note that, in particular, we have multiplicative estimates for all intervals, except those in Li, Lj*, 
Ri and the interval containing zq. Let S be the set of all intervals except Lj*, Lj and Rj for 
3 < ciiid the one containing zq Then, since each interval has probability mass at most 0(e^/^) 

and we are excluding 0(l/\/e) intervals, p{S) > 1 — 0{s). 

We now compute the x^’distance induced by this approximation for elements in S. For an 
element z G Lj n 5, we have 

(/(z) -p{i))^ ^ 60p{i) 
p{i) - p 


Summing over all i £ Lj f] S gives 

since the probability mass of Lj is at most . Summing this over all Lj for j > 1 /*/£ and j ^ j* 
gives 




2 / 6 - 


. 3/2 


1 





—^dx 


= 60 ^ 3/2 (^) 
= 0(6^) 


as desired. 


Part 2. To obtain a distribution q G CCDn, we write a linear program. We will work in the log 
domain, so our variables will be Qi, representing logg(z) for z G [n]. We will use Fi = log/(z) as 
parameters in our LP. There will be no objective function, we simply search for a feasible point. 
Our constraints will be 

Qi—l + Qi+l F: ‘^Qi Vz G [n — 1] 

Qi < 0 Vz G [n] 

log(l + e) < \Qi - Fi\ < log(l + e) for z G M 
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fog ( 1 - ^ ) <\Qi- Fi\ < 


log (^1 + for i e Fj,j > 1/Ve and j / f 
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log 


< \Qi - Fi\ < log for i E Rj,j >l/Ve 

If we run the linear program, then after a rescaling and summing the error over all the intervals 
in the LP gives us that the distance between p and q to be O(e^) x^-distance in a set S which has 
measure p{S) > 1 — 4e, as desired. 

If the linear program finds a feasible point, then we obtain a g E CCVn- Furthermore, if 
p E CCT>n) this also tells us that (after a rescaling of e), summing the error over all intervals implies 
that x^{Ps, Qs) < ^ for a known set S with p{S) > 1 — 0{e), as desired. If M 7 ^ 0, this algorithm 
works as described. The issue is if M = 0, then we don’t know when the L intervals end and the R 
intervals begin. In this case, we run 0(l/e) LPs, using each interval as the one containing iq, and 
thus acting as the barrier between the L intervals (to its left) and the R intervals (to its right). If 
p truly was log-concave, then one of these guesses will be correct and the corresponding LP will 
find a feasible point. □ 

G Details on MHR testing 

Proof of Lemma As with log-concave distributions, our method for MHR distributions can be 
split into two parts. In the first step, if p E AiRTZn, we obtain a distribution q which is 0(e^)-close 
to p in distance on a set A of intervals such that p{A) > 1 — 0(e). q will achieve this by being a 
multiplicative (1 -|- 0(e)) approximation for each element within these intervals. This step is very 
similar to the decomposition used for unimodal distributions (described in Section [D]), so we sketch 
the argument and highlight the key differences. 

The second step will be to find a feasible point in a linear program. If p E AiRTZn, there 
should always be a feasible point, indicating that q is close to a distribution in AiRTZn (leveraging 
the particular guarantees for our algorithm for generating q). If d^yi^p, AiRTZn) > e, there may 
or may not be a feasible point, but when there is, it should imply the existence of a distribution 
p* E AiRTZn such that dTy{q,p*) < e/2. 

The analysis will rely on the following lemma from |CDSS13] . which roughly states that an 
MHR distribution is “almost” non-decreasing. 

Lemma 15 (Lemma 5.1 in [CDSS13] ). Let p be an MHR distribution over [n]. Let I = [a, 6 ] C [n] 
be an interval, and ii = [ 6 -|- l,n] be the elements to the right of I. Let p = p{I)/p{R). Then 
p{b+l) > j^p{a). 


Part 1. As before, with unimodal distributions, we start by taking O(^^f^) samples, with the 
goal of partitioning the domain into intervals of mass approximately 0(1/6). First, we will ignore 
the left and rightmost intervals of mass 0(e). For all “heavy” elements with mass > 0(1/6), we 
consider them as singletons. We note that Lemma fTHl implies that there will be at most 0(l/e) 
contiguous intervals of such elements. The rest of the domain is greedily divided (from left to right) 
into intervals of mass 0(1/6), cutting an interval short if we reach one of the heavy elements. This 
will result in the guarantee that all but potentially 0 (l/e) intervals have 0 ( 1 / 6 ) mass. 

Next, similar to unimodal distributions, considering the flattened distribution, we discard all 
intervals for which the per-element probability is not within a (1 ± 0 (e)) multiplicative factor of 
the same value for both neighboring intervals. The claim is that all remaining intervals will have 
the property that the per-element probability is within a ( 1 ± 0 (£)) multiplicative factor of the true 
probability. This is implied by Lemma [151 If there were a point in an interval which was above 
this range, the distribution must decrease slowly, and the next interval would have a much larger 
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per-element weight, thus leading to the removal of this interval. A similar argument forbids us from 
missing an interval which contains a point that lies outside this range. Relying on the fact that 
truncating the left and rightmost intervals eliminates elements with low probability mass, similar 
to the unimodal case, one can show that we will remove at most log(n/e)/e intervals, and thus a 
log(n/e)/ 6 e probability mass. Choosing b = log(n/e)) limits this to be 0(e), as desired. At 

this point, if p is indeed MHR, the multiplicative estimates guarantee that the result is 0(e^)-close 
in x^-distance among the remaining intervals. 

Part 2. We note that an equivalent condition for distribution / being MHR is log-concavity of 
log(l — F), where F is the CDF of /. Therefore, our approach for this part will be similar to the 
approach used for log-concave distributions. 

Given the output distribution q from the previous part of this algorithm, our goal will be check 
if there exists an MHR distribution / which is 0(e)-close to q. We will run a linear program with 
variables fi = log(l — Fi). First, we ensure that / is a distribution. This can be done with the 
following constraints: 


f j < 0 Vi € [n] 
fi > fi+i Vi € [n- 1] 

fn = -OO 

To ensure that / is MHR, we use the following constraint: 

fi-i + fi+i < 2fj Vi € [2, n — 1] 

Now, ideally, we would like to ensure / and q are e-close in total variation distance by ensuring 
they are pointwise within a multiplicative (1 ± e) factor of each other: 

(1 -e) < fi/gi < (l + £) 

We note that this is a stronger condition than / and q being e-close, but if p G AiT-LTZn, the 
guarantees of the previous step would imply the existence of such an /. 

We have a separate treatment for the identihed singletons (i.e., those with probability > 1/6) 
and the remainder of the support. For each element qi identihed to have >1/6 mass, we add two 
constraints: 

log((l - e/26)(l - Qi)) < fi < log((l -t- e/26)(l - Qi)) 
log((l - e/26)(l - Qi-i)) < fi-i < log((l e/26)(l - Qi-i)) 

If we satisfy these constraints, it implies that 

qi - ejh <fi<qiF ejh. 


Since qi > 1/6, this implies 

(1 - e)qi < fi< {l+e)qi 

as desired. 

Now, the remaining elements each have <1/6 mass. For each such element qi, we create a 
constraint 

(1 - < f,_i - f. < (1 + 0 { e ))-^^ 

J- — Vi-l J. — V*-l 
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Note that the middle term is 


-log 


l-Fj 
1 — Fi-i 


= - log 



h 


1 — Fi-^ 


E 


h 

1 — Fi-i 


(1 ± 2e), 


where the second equality uses the Taylor expansion and the facts that /* < 1/6 and 1 — -Fj_i > e 
(since during the previous part, we ignored the rightmost 0(e) probability mass). If we satisfy the 
desired constraints, it implies that 


fi £ 


1 1 - i^^-l 

(1 ± 2 £) 1 - 


(1^0(e))qi. 


Since we are taking n(l/e^) samples and 1 — Tj_i > n(e). Lemma [T] implies that fi is indeed a 
multiplicative (1 ± e) approximation for these points as well. 

We note that all points which do not fall into these two cases make up a total of 0(e) probability 
mass. Therefore, / may be arbitrary at these points and only incur 0(e) cost in total variation 
distance. 

If we find a feasible point for this linear program, it implies the existence of an MHR distribution 
within 0(e) total variation distance. In this case, we continue to the testing portion of the algorithm. 
Furthermore, if p E AiTFIZn, our method for generating q certifies that such a distribution exists, 
and we continue on to the testing portion of the algorithm. □ 


H Details of the Lower Bounds 

In this section, for the class of distributions Q described in discussion on lower bounds and a class 
of interest C, we show that dTv(C) Q) > £) thus implying a lower bound of 0,(^/n/e^) for testing C. 

H.l Monotone distributions 

We first consider d = 1 and prove that for appropriately chosen c, any monotone distribution over 
[n] is e-far from all distributions in Q. Consider any q ^ Q. For this distribution, we say that i E [n] 
is a raise-point if qi < qi+i. Let Rg be the set of raise points of q. For (7 E Q, (l 6 |) implies at least one 
in every four consecutive integers in [n] is a raise point, and therefore, > n./4. Moreover, note 
that if i is a raise-point, then i + I is not a raise point. For any monotone (decreasing) distribution 
P, Pi > Pi+i- For any raise-point i G Rg, by the triangle inequality, 

2ce 

\Pi - Qi\ + \Pi+i - Qi-ei\ > \Pi - Pi-ei + Qi+i - Qi\ > Qi-ei - Qi = —• ( 19 ) 

n 

Summing over the set Rg, we obtain dTY(p,q) > ^\Fg\ ■ ^ > ce/4. Therefore, if c > 4, then 
dTv(Mn, q) > This proves the lower bound for d = I. 

This argument can be extended to [n]'^. Consider the following class of distributions on [n]'^. 
For each point i = (ii ,... ,id) G [n]'^, where ii is even, generate a random z E (—1,1}, and assign 
to i a probability of (I -|- zce)fn‘^. Let ei (1,0,... ,0). Similar to d = I, assign a probability 

„<l/2 

(I — zce)/n'^ to the point i -|- ei = (zi -|- 1 ,^ 2 , • • • ,id)- This class consists of 2 2 distributions, 
and Paninski’s arguments extend to give a lower bound of n(n‘^/^/e^) samples to distinguish this 
class from the uniform distribution over It remains to show that all these distributions are e 
far from A4^. Call a point i as a raise point if pi < Pi+ei- For any i, one of the points i, i -|- ei, 
i-|-2ei, i-|-3ei is a raise point, and the number of raise points is at least n'^/A. Invoking the triangle 
inequality (identical to (|I9p ) over the raise-points, in the hrst dimension shows that any monotone 
distribution over [n]'^ is at a distance ^ from any distribution in this class. Choosing c = 4 yields 
a bound of e. 
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H.2 Testing Product Distributions 

Our idea for testing independence is similar to the previous section. We sketch the construction of 
a class of distributions on ^ = [ni] x • • • x [n^]. Then \X\ = ni ■ n 2 ■ ■ ■ ■ Ud- For each element in X 
assign a value (lice) and then for each such assignment, normalize the values so that they add to 1, 
giving rise to a distribution. This gives us a class of distributions. The key argument is to show 
that a large fraction of these distributions are far from being a product distribution. This follows 
since the degrees of freedom of a product distribution is exponentially smaller than the number of 
possible distributions. The second step is to simply apply Paninski’s argument, now over the larger 
set of distributions, where we show that distinguishing the collection of distributions we constructed 
from the uniform distribution over X (which is a product distribution) requires y/\X\/£^ samples. 


H.3 Log-concave and Unimodal distributions 

We will show that any log-concave or unimodal distribution is e-far from all distributions in Q. 
Since CCVn C lAn, it will suffice to show this for every unimodal distribution. Consider any 
unimodal distribution p, with mode i. Then, p is monotone non-decreasing over the interval [P\ and 
non-increasing over {£-|- 1,..., n}. By the argument for monotone distributions, the total variation 
distance between p and any distribution q over elements greater than £ is at least and 

over elements less than i is at least Summing these two gives the desired bound. 


H.4 Monotone Hazard distributions 

We will show that any monotone hazard rate distribution is e-far from all distributions in Q. 

Let p be any monotone-hazard distribution. Any distribution q ^ Q has mass at least 1/2 over 

the interval I = [re/4,3n/4]. Therefore, by Lemma fTHl for any i G I, ^1 -|- > pt. As noted 

before, at least re/8 of the raise-points are in I. 

For any i G I f] Rg, qi = {1 -\- ce)ln, qi+i = (1 — ce)/re 


di = \Pi-qi\ + \Pi+i - g'i+il- 


( 20 ) 


If Pi > (1 -|- ‘lcE)ln OT Pi < 1/re, then the first term, and therefore di is at least cejn. If 
Pi G (1/re, (1 -|- 2ce)ln), then for re > 5/(ce) 

1 1 1 - ce/2 

Pi+i >-r > 


re 1 + ^ ~ 


re 


Therefore the second term of di is at ce/2re. Since there are at least re/8 raise points in I, 


In ce ce 

a 2 8 to - l6' 

Thus any MHR distribution is e-far from Q for c > 16. 


( 21 ) 
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